Goto

Collaborating Authors

 incoming weight



A Missing lemmas for the proof of Theorem 3.1

Neural Information Processing Systems

The following proof is from Daniely and V ardi [15], and we give it here for completeness. By Lemma A.1, there exists a DNF formula We construct such an affine layer in Lemma A.2. At least one of the k size-n slices in z contains 0 more than once. We define the outputs of our affine layer as follows. Pr [z represents a hyperedge ] = n (n 1) ... (n k + 1) null 1 n null Pr null z Z null 1 2 log(n) .





The Recurrent Cascade-Correlation Architecture

Neural Information Processing Systems

Recurrent Cascade-Correlation CRCC) is a recurrent version of the Cascade(cid:173) Correlation learning architecture of Fah I man and Lebiere [Fahlman, 1990]. RCC can learn from examples to map a sequence of inputs into a desired sequence of outputs. New hidden units with recurrent connections are added to the network as needed during training. In effect, the network builds up a finite-state machine tailored specifically for the current problem. RCC retains the advantages of Cascade-Correlation: fast learning, good generalization, automatic construction of a near-minimal multi-layered network, and incremental training.


On Connectivity of Solutions in Deep Learning: The Role of Over-parameterization and Feature Quality

arXiv.org Machine Learning

It has been empirically observed that, in deep neural networks, the solutions found by stochastic gradient descent from different random initializations can be often connected by a path with low loss. Recent works have shed light on this intriguing phenomenon by assuming either the over-parameterization of the network or the dropout stability of the solutions. In this paper, we reconcile these two views and present a novel condition for ensuring the connectivity of two arbitrary points in parameter space. This condition is provably milder than dropout stability, and it provides a connection between the problem of finding low-loss paths and the memorization capacity of neural nets. This last point brings about a trade-off between the quality of features at each layer and the over-parameterization of the network. As an extreme example of this trade-off, we show that (i) if subsets of features at each layer are linearly separable, then almost no over-parameterization is needed, and (ii) under generic assumptions on the features at each layer, it suffices that the last two hidden layers have $\Omega(\sqrt{N})$ neurons, $N$ being the number of samples. Finally, we provide experimental evidence demonstrating that the presented condition is satisfied in practical settings even when dropout stability does not hold.


Interpreting a Penalty as the Influence of a Bayesian Prior

arXiv.org Machine Learning

For instance, penalties are used to improve generalization, prune neurons or reduce the rank of tensors of weights. Therefore, usual penalties are mostly empirical and user-defined, and integrated to the loss as follows: L( w) null( w) r (w), with w the vector of all parameters in the network, null( w) the error term and r (w) the penalty term. From a Bayesian point of view, optimizing such a loss L is equivalent to finding the Maximum A Posteriori (MAP) of the parameters w given the training data and a prior ฮฑ exp( r). Indeed, assuming that the loss null is a log-likelihood loss, namely, null(w) ln p w( D) with dataset D, then minimizing L is equivalent to minimizing L MAP(w) ln p w(D) ln(ฮฑ (w)). Thus, within the MAP framework, we can interpret the penalty term r as the influence of a prior ฮฑ [14]. However, the MAP approximates the Bayesian posterior very roughly, by taking its maximum. Variational Inference (VI) provides a variational posterior distribution rather than a single value, hopefully representing the Bayesian posterior much better. VI looks for the best posterior approximation within a family ฮฒ u(w) of approximate posteriors over w, parameterized Inria, Team TAU, Gif-sur-Yvette, France โ€  Facebook, France 1 arXiv:2002.00178v1


Multi-Task Zipping via Layer-wise Neuron Sharing

Neural Information Processing Systems

Future mobile devices are anticipated to perceive, understand and react to the world on their own by running multiple correlated deep neural networks on-device. Yet the complexity of these neural networks needs to be trimmed down both within-model and cross-model to fit in mobile storage and memory. Previous studies focus on squeezing the redundancy within a single neural network. In this work, we aim to reduce the redundancy across multiple models. We propose Multi-Task Zipping (MTZ), a framework to automatically merge correlated, pre-trained deep neural networks for cross-model compression. Central in MTZ is a layer-wise neuron sharing and incoming weight updating scheme that induces a minimal change in the error function. MTZ inherits information from each model and demands light retraining to re-boost the accuracy of individual tasks. Evaluations show that MTZ is able to fully merge the hidden layers of two VGG-16 networks with a 3.18% increase in the test error averaged on ImageNet and CelebA, or share 39.61% parameters between the two networks with <0.5% increase in the test errors for both tasks. The number of iterations to retrain the combined network is at least 17.8 times lower than that of training a single VGG-16 network. Moreover, experiments show that MTZ is also able to effectively merge multiple residual networks.


Multi-Task Zipping via Layer-wise Neuron Sharing

Neural Information Processing Systems

Future mobile devices are anticipated to perceive, understand and react to the world on their own by running multiple correlated deep neural networks on-device. Yet the complexity of these neural networks needs to be trimmed down both within-model and cross-model to fit in mobile storage and memory. Previous studies focus on squeezing the redundancy within a single neural network. In this work, we aim to reduce the redundancy across multiple models. We propose Multi-Task Zipping (MTZ), a framework to automatically merge correlated, pre-trained deep neural networks for cross-model compression. Central in MTZ is a layer-wise neuron sharing and incoming weight updating scheme that induces a minimal change in the error function. MTZ inherits information from each model and demands light retraining to re-boost the accuracy of individual tasks. Evaluations show that MTZ is able to fully merge the hidden layers of two VGG-16 networks with a 3.18% increase in the test error averaged on ImageNet and CelebA, or share 39.61% parameters between the two networks with <0.5% increase in the test errors for both tasks. The number of iterations to retrain the combined network is at least 17.8 times lower than that of training a single VGG-16 network. Moreover, experiments show that MTZ is also able to effectively merge multiple residual networks.